Rachel the Robo Caller - Analysis

At DEF CON 22, the FTC ran a contest to help mitigate robocalls. There were three rounds, the last of which was using a set of call records collected from a robocall honeypot to determine if a caller was a robocaller. See Parts I and II of the contest for details on robocaller honeypots.

The FTC gave us two sets of data, that show a phone call from one "person" to another along with the date and time. Both collections have been randomized uniquely, but the portions of area code and subscriber number were kept the same.

This Notebook details initial exploration of the data. For the follow up on predictions, check out Modeling Rachel the Robocaller.



In [21]:

    
from IPython.display import Image
Image("http://www.ftc.gov/system/files/attachments/zapping-rachel/zapping-rachel-contest.jpg")









    Out[21]:



In [24]:

    
%matplotlib inline
# Standard toolkits in pydata land
import pandas as pd
import numpy as np



In [2]:

    
# Neat little library that is a partial port of Google's libphonenumber
import phonenumbers
from phonenumbers import geocoder
# from phonenumbers import carrier
from phonenumbers import timezone



In [3]:

    
# First pass will use a Random Forest; more on this later
from sklearn import preprocessing
from sklearn.ensemble import RandomForestClassifier



In [4]:

    
def read_FTC(dataset):
    return pd.read_csv(dataset,
                parse_dates=["DATE/TIME"],
                converters={'LIKELY ROBOCALL': lambda val: val == 'X'},
                dtype={'TO': str, 'FROM': str, 'LIKELY ROBOCALL': bool}
    )

# This assumes you have the data locally
labeled_data = read_FTC("FTC-DEFCON Data Set 1.csv")
unlabeled_data = read_FTC("FTC-DEFCON Data Set 2.csv")



In [5]:

    
labeled_data.head()









    Out[5]:






  
    
      
      TO
      FROM
      DATE/TIME
      LIKELY ROBOCALL
    
  
  
    
      0
       17866291260
       13055793696
      2014-04-01
       False
    
    
      1
       14027826713
       12063339487
      2014-04-01
        True
    
    
      2
       17083187970
       12246108402
      2014-04-01
       False
    
    
      3
       17733095581
       13035009570
      2014-04-01
        True
    
    
      4
       19188765408
       16153878533
      2014-04-01
        True



In [6]:

    
unlabeled_data.head()









    Out[6]:






  
    
      
      TO
      FROM
      DATE/TIME
      LIKELY ROBOCALL
    
  
  
    
      0
       16163847430
       13236069958
      2014-06-01
       False
    
    
      1
       12025176283
       12029867020
      2014-06-01
       False
    
    
      2
       18663049187
       15159256650
      2014-06-01
       False
    
    
      3
       15594157085
       16199247140
      2014-06-01
       False
    
    
      4
       18582407865
       19492012595
      2014-06-01
       False

First things to note right off the bat:

We could let phonenumbers parse the numbers out for us and cache that
The phone numbers are not really a numeric value and should be treated as categorical data
The phone number should be broken up into individual categorical features, likely:
- Area code
- Carrier/Subscriber
- Not the last 4 digits though as they are randomized and need to be paired with the rest of the number to be unique
It is unknown whose timezone the date and time is in. They could (should?) be normalized for each calling side
Time zone can be extracted from the phone numbers themselves, but it says nothing about where the caller actually is

Let's see how the phonenumbers library works



In [7]:

    
# Pulling a random number from the data set
fake_number = phonenumbers.parse("19188765408")









    



---------------------------------------------------------------------------
NumberParseException                      Traceback (most recent call last)
<ipython-input-7-5acd0111a6aa> in <module>()
      1 # Pulling a random number from the data set
----> 2 fake_number = phonenumbers.parse("19188765408")

/usr/local/lib/python2.7/dist-packages/phonenumbers/phonenumberutil.pyc in parse(number, region, keep_raw_input, numobj, _check_region)
   2453     if _check_region and not _check_region_for_parsing(national_number, region):
   2454         raise NumberParseException(NumberParseException.INVALID_COUNTRY_CODE,
-> 2455                                    "Missing or invalid default region.")
   2456     if keep_raw_input:
   2457         numobj.raw_input = number

NumberParseException: (0) Missing or invalid default region.



In [8]:

    
# Looking back at their docs, a leading '+' and a region of None will make phonenumbers attempt to detect region, etc.
fake_number = phonenumbers.parse("+19188765408", None)
fake_number









    Out[8]:





PhoneNumber(country_code=1, national_number=9188765408, extension=None, italian_leading_zero=False, number_of_leading_zeros=None, country_code_source=None, preferred_domestic_carrier_code=None)



In [9]:

    
fake_number.country_code









    Out[9]:





1



In [10]:

    
phonenumbers.is_valid_number(fake_number)









    Out[10]:





True



In [11]:

    
geocoder.description_for_number(fake_number, "EN")









    Out[11]:





'Oklahoma'



In [12]:

    
timezone.time_zones_for_number(fake_number)









    Out[12]:





('America/Chicago',)

Picking out features for the numbers themselves



In [13]:

    
# Do they all start with a 1?
print(labeled_data["TO"].str.get(0).unique())
print(labeled_data["FROM"].str.get(0).unique())
print(unlabeled_data["TO"].str.get(0).unique())
print(unlabeled_data["FROM"].str.get(0).unique())









    



['1']
['1']
['1']
['1']

Yup! This means we're using the North American Numbering Plan.

The NANP is a system of numbering plan areas (NPA) using telephone numbers consisting of a three-digit area code, a three-digit central office code, and a four-digit station number. Through this plan, telephone calls can be directed to particular regions of the larger NANP public switched telephone network (PSTN), where they are further routed by the local networks. The NANP is administered by the North American Numbering Plan Administration (NANPA), a service operated by Neustar corporation. The international calling code for the NANP is 1.

Our phone number structure is then CAAAOOONNNN where C is the country code, AAA is the area code, OOO is the "central office" code (does this come from the old operator days?), and NNNN are the rest of the unique digits for a caller (formally called the station number). We have randomized calls though, so we'll be ignoring NNNN as part of any feature on their own.

Parsing the area code and central office code is trivial with Pandas semantics. However, there are a few utilities in the phonenumbers library that might help in a little bit though, namely:

geocoder.description_for_number
phonenumbers.is_valid_number
timezone.time_zones_for_number



In [14]:

    
# Let's go ahead and parse all of them 
# We'll parse with a leading + based on the numbers being listed with a leading country code,
# leave the second argument as None so that the phonenumbers package has to try to detect 

labeled_data["TO_PARSED"] = labeled_data["TO"].apply(lambda row: phonenumbers.parse("+" + row, None))
labeled_data["FROM_PARSED"] = labeled_data["FROM"].apply(lambda row: phonenumbers.parse("+" + row, None))



In [15]:

    
labeled_data["TO_VALID"] = labeled_data["TO_PARSED"].apply(lambda ph: phonenumbers.is_valid_number(ph))
labeled_data["FROM_VALID"] = labeled_data["FROM_PARSED"].apply(lambda ph: phonenumbers.is_valid_number(ph))



In [16]:

    
labeled_data.TO_VALID.unique()









    Out[16]:





array([True], dtype=object)



In [17]:

    
labeled_data.FROM_VALID.unique()









    Out[17]:





array([True, False], dtype=object)

There are invalid numbers in the from case?!? What proportion of those are from the likely robocallers?



In [18]:

    
from_valid_v_robocall = pd.crosstab([labeled_data.FROM_VALID], labeled_data['LIKELY ROBOCALL'])
from_valid_v_robocall.plot(kind='bar', stacked=True, grid=False, color=["blue", "red"])
from_valid_v_robocall









    Out[18]:






  
    
      LIKELY ROBOCALL
      False
      True
    
    
      FROM_VALID
      
      
    
  
  
    
      False
         839
         356
    
    
      True 
       91927
       44241



In [25]:

    
from_valid_v_robocall.div(from_valid_v_robocall.sum(1).astype(float), axis=0).plot(kind='barh', stacked=True, color=["blue", "red"])









    Out[25]:





<matplotlib.axes.AxesSubplot at 0x7fd32400ef90>

Come to think of it, maybe from valid is a no-good-bad-feature as the numbers are randomized but not necessarily made valid by the FTC. Hmmm... Moving on.

While we're at it, might as well make a utility function to do our cross tabulation plots against likely robocalls.



In [26]:

    
def explore_feature(df, name):
    feature_v_robocall = pd.crosstab([df[name]], df['LIKELY ROBOCALL'])
    feature_v_robocall.plot(kind='bar', stacked=True, grid=False, color=["blue", "red"])
    fvr_div = feature_v_robocall.div(feature_v_robocall.sum(1).astype(float), axis=0)
    fvr_div.plot(kind='barh', stacked=True, color=["blue", "red"])
    return feature_v_robocall



In [27]:

    
labeled_data["TO_DESCRIPTION"] = labeled_data["TO_PARSED"].apply(lambda ph: geocoder.description_for_number(ph, "EN"))
labeled_data["FROM_DESCRIPTION"] = labeled_data["FROM_PARSED"].apply(lambda ph: geocoder.description_for_number(ph, "EN"))

def get_time_zone(ph):
    tz = timezone.time_zones_for_number(ph)

labeled_data["TO_TIMEZONE"] = labeled_data["TO_PARSED"].apply(lambda ph: timezone.time_zones_for_number(ph))
labeled_data["FROM_TIMEZONE"] = labeled_data["FROM_PARSED"].apply(lambda ph: timezone.time_zones_for_number(ph))



In [28]:

    
labeled_data["FROM_TIMEZONE"].unique()









    Out[28]:





array([('America/New_York',), ('America/Los_Angeles',),
       ('America/Chicago',), ('America/Denver',),
       ('America/Anguilla', 'America/Antigua', 'America/Barbados', 'America/Cayman', 'America/Chicago', 'America/Denver', 'America/Dominica', 'America/Edmonton', 'America/Grand_Turk', 'America/Grenada', 'America/Halifax', 'America/Jamaica', 'America/Juneau', 'America/Los_Angeles', 'America/Lower_Princes', 'America/Montserrat', 'America/Nassau', 'America/New_York', 'America/Port_of_Spain', 'America/Puerto_Rico', 'America/St_Johns', 'America/St_Kitts', 'America/St_Lucia', 'America/St_Thomas', 'America/St_Vincent', 'America/Toronto', 'America/Tortola', 'America/Vancouver', 'America/Winnipeg', 'Atlantic/Bermuda', 'Pacific/Guam', 'Pacific/Honolulu', 'Pacific/Pago_Pago', 'Pacific/Saipan'),
       (u'Etc/Unknown',), ('America/Toronto',), ('America/Winnipeg',),
       ('America/Edmonton',), ('America/Juneau',), ('Pacific/Honolulu',),
       ('America/Halifax',), ('America/Vancouver',), ('America/Jamaica',),
       ('America/St_Johns',), ('America/Puerto_Rico',),
       ('America/Grenada',), ('America/St_Thomas',), ('Pacific/Guam',),
       ('Pacific/Saipan',), ('America/Nassau',),
       ('America/Port_of_Spain',), ('America/Dominica',)], dtype=object)



In [29]:

    
[len(x) for x in labeled_data["FROM_TIMEZONE"].unique()]









    Out[29]:





[1, 1, 1, 1, 34, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]



In [30]:

    
# For ease of plotting, I'm turning that long tuple into one string and all these into strings
def get_time_zone(ph):
    # Playing fast and loose here since only one grouping had more than one timezone in one
    tz = timezone.time_zones_for_number(ph)
    if len(tz) > 1:
        tz = ("Etc/Lots",)
    return tz[0]

labeled_data["TO_TIMEZONE"] = labeled_data["TO_PARSED"].apply(lambda ph: get_time_zone(ph))
labeled_data["FROM_TIMEZONE"] = labeled_data["FROM_PARSED"].apply(lambda ph: get_time_zone(ph))

Wow, one of those timezones is pretty much unknown.



In [31]:

    
labeled_data[labeled_data["FROM_TIMEZONE"] == "Etc/Lots"].groupby("LIKELY ROBOCALL").aggregate(sum)









    Out[31]:






  
    
      
      TO_VALID
      FROM_VALID
    
    
      LIKELY ROBOCALL
      
      
    
  
  
    
      False
       10817
       10817
    
    
      True 
       11225
       11225



In [32]:

    
explore_feature(labeled_data, 'FROM_TIMEZONE')









    Out[32]:






  
    
      LIKELY ROBOCALL
      False
      True
    
    
      FROM_TIMEZONE
      
      
    
  
  
    
      America/Chicago
       18604
        6350
    
    
      America/Denver
        5350
        1805
    
    
      America/Dominica
           0
           1
    
    
      America/Edmonton
         203
          15
    
    
      America/Grenada
           1
           0
    
    
      America/Halifax
         248
           0
    
    
      America/Jamaica
          50
           0
    
    
      America/Juneau
          66
           0
    
    
      America/Los_Angeles
       20225
       11578
    
    
      America/Nassau
           1
           0
    
    
      America/New_York
       34589
       12832
    
    
      America/Port_of_Spain
           1
           0
    
    
      America/Puerto_Rico
          44
           0
    
    
      America/St_Johns
          46
           0
    
    
      America/St_Thomas
           4
           0
    
    
      America/Toronto
        1192
         430
    
    
      America/Vancouver
         210
           2
    
    
      America/Winnipeg
         143
           0
    
    
      Etc/Lots
       10817
       11225
    
    
      Etc/Unknown
         839
         356
    
    
      Pacific/Guam
           3
           0
    
    
      Pacific/Honolulu
         114
           3
    
    
      Pacific/Saipan
          16
           0

That America/Dominica one looks interesting on the last plot (percentage of likely robocall by FROM_TIMEZONE) but there is only 1 data point. That "Etc/Lots" timezone is probably interesting though.

In reality, the timezone is being pulled out from the country code + the area code. We should just use Pandas semantics on the area code.



In [33]:

    
# Extract the area code using slicing since they are all regular US numbers
labeled_data["TO_AREA_CODE"] = labeled_data["TO"].str.slice(1,4)
labeled_data["FROM_AREA_CODE"] = labeled_data["FROM"].str.slice(1,4)



In [34]:

    
labeled_data.TO_AREA_CODE.describe()









    Out[34]:





count     137363
unique       300
top          888
freq       11316
dtype: object



In [35]:

    
labeled_data.FROM_AREA_CODE.describe()









    Out[35]:





count     137363
unique       430
top          800
freq        5843
dtype: object



In [36]:

    
to_area_code_v_likely_robocall = explore_feature(labeled_data, "TO_AREA_CODE")

Methinks there are too many area codes to visualize that. Let's look at just the subset that is potentially interesting.



In [37]:

    
area_code_div = to_area_code_v_likely_robocall.div(to_area_code_v_likely_robocall.sum(1).astype(float), axis=0)
sample_size = to_area_code_v_likely_robocall.sum(1)
threshold = .20
min_samples = 10
threshold_true_robo = (sample_size > min_samples) & ((area_code_div[True] < threshold) | (area_code_div[True] > (1 - threshold)))

thresholded_area_robo = area_code_div[threshold_true_robo]

to_area_code_v_likely_robocall[threshold_true_robo].plot(kind='bar', stacked=True, grid=False, color=["blue", "red"])
thresholded_area_robo.plot(kind='barh', stacked=True, color=["blue", "red"])
thresholded_area_robo









    Out[37]:






  
    
      LIKELY ROBOCALL
      False
      True
    
    
      TO_AREA_CODE
      
      
    
  
  
    
      202
       0.822464
       0.177536
    
    
      204
       1.000000
       0.000000
    
    
      212
       0.851852
       0.148148
    
    
      213
       0.837956
       0.162044
    
    
      225
       0.820144
       0.179856
    
    
      226
       1.000000
       0.000000
    
    
      276
       0.828947
       0.171053
    
    
      289
       0.982857
       0.017143
    
    
      303
       0.822581
       0.177419
    
    
      312
       0.802326
       0.197674
    
    
      314
       0.814085
       0.185915
    
    
      319
       0.876033
       0.123967
    
    
      418
       1.000000
       0.000000
    
    
      519
       0.959596
       0.040404
    
    
      615
       0.810651
       0.189349
    
    
      646
       0.827586
       0.172414
    
    
      647
       0.923077
       0.076923
    
    
      705
       1.000000
       0.000000
    
    
      754
       1.000000
       0.000000
    
    
      770
       0.812030
       0.187970
    
    
      772
       0.843750
       0.156250
    
    
      780
       0.981481
       0.018519
    
    
      800
       1.000000
       0.000000
    
    
      808
       0.806723
       0.193277
    
    
      828
       0.913363
       0.086637
    
    
      855
       0.904875
       0.095125
    
    
      866
       0.986063
       0.013937
    
    
      877
       0.993841
       0.006159
    
    
      888
       0.991340
       0.008660
    
    
      906
       0.805556
       0.194444
    
    
      915
       0.822917
       0.177083
    
    
      949
       0.897206
       0.102794
    
    
      980
       0.818182
       0.181818



In [38]:

    
to_area_code_v_likely_robocall[(area_code_div[True] > (1 - threshold))]









    Out[38]:






  
    
      LIKELY ROBOCALL
      False
      True
    
    
      TO_AREA_CODE
      
      
    
  
  
    
      250
       0
       1
    
    
      331
       1
       6

Do it again with office codes?



In [39]:

    
# Extract the area code using slicing since they are all regular US numbers
#  labeled_data["TO_OFFICE_CODE"] = labeled_data["TO"].str.slice(4,7)
#  labeled_data["FROM_OFFICE_CODE"] = labeled_data["FROM"].str.slice(4,7)

#  Wait a second, these office codes need to be paired with their area codes. We'll have to include those.
labeled_data["TO_OFFICE_CODE"] = labeled_data["TO"].str.slice(1,7)
labeled_data["FROM_OFFICE_CODE"] = labeled_data["FROM"].str.slice(1,7)



In [40]:

    
# This is going to have the same (and worse) issue that exploring area code did.
# We'll create a thresholded utitlity function here

def explore_thresholded_feature(df, name, threshold=.20, min_samples=10):
    feature_v_robocall = pd.crosstab([df[name]], df['LIKELY ROBOCALL'])
    proportionated_feature = feature_v_robocall.div(feature_v_robocall.sum(1).astype(float), axis=0)
    sample_size = feature_v_robocall.sum(1)
    
    threshold_true_robo = (sample_size > min_samples) & ((proportionated_feature[True] < threshold) | (proportionated_feature[True] > (1 - threshold)))
    
    thresholded_feature_v_robocall = feature_v_robocall[threshold_true_robo]
    
    thresholded_feature_v_robocall.plot(kind='barh', stacked=True, color=["blue", "red"])
    proportionated_feature[threshold_true_robo].plot(kind='barh', stacked=True, color=["blue", "red"])
    
    return thresholded_feature_v_robocall



explore_thresholded_feature(labeled_data, "TO_OFFICE_CODE", threshold=.08, min_samples=25)









    Out[40]:






  
    
      LIKELY ROBOCALL
      False
      True
    
    
      TO_OFFICE_CODE
      
      
    
  
  
    
      201266
         47
        0
    
    
      201453
        102
        0
    
    
      201676
         34
        0
    
    
      201693
         37
        1
    
    
      202640
         25
        2
    
    
      203416
         29
        0
    
    
      203802
         37
        0
    
    
      206496
        465
       24
    
    
      213344
        239
        2
    
    
      213785
         38
        2
    
    
      215717
          3
       41
    
    
      229256
          2
       40
    
    
      239201
         64
        0
    
    
      240297
         39
        1
    
    
      281661
          1
       41
    
    
      289814
        162
        0
    
    
      301658
         64
        1
    
    
      303848
         41
        0
    
    
      304527
          2
       33
    
    
      304982
         55
        0
    
    
      305548
          2
       33
    
    
      305809
         99
        0
    
    
      314282
         36
        1
    
    
      314714
         27
        0
    
    
      314888
         26
        1
    
    
      316313
          0
       32
    
    
      317653
         43
        1
    
    
      318497
         36
        1
    
    
      319313
         40
        0
    
    
      319540
         63
        0
    
    
      ...
      ...
      ...
    
    
      888803
         48
        0
    
    
      888814
         34
        0
    
    
      888815
         35
        0
    
    
      888821
         38
        0
    
    
      888834
         43
        0
    
    
      888864
         40
        0
    
    
      888870
         98
        0
    
    
      888875
         37
        0
    
    
      888885
         33
        1
    
    
      888919
         29
        0
    
    
      888931
         36
        0
    
    
      888966
         69
        0
    
    
      888979
         42
        0
    
    
      888997
        123
        0
    
    
      888998
       2632
        1
    
    
      906428
         35
        0
    
    
      906563
         30
        0
    
    
      909275
        103
        8
    
    
      909361
         27
        1
    
    
      912226
         29
        2
    
    
      919289
         36
        0
    
    
      919636
         25
        2
    
    
      925575
        213
        4
    
    
      937595
         66
        4
    
    
      940202
         33
        0
    
    
      949528
       1418
        7
    
    
      951225
          2
       29
    
    
      954621
         36
        1
    
    
      972201
         86
        2
    
    
      973273
        120
        8
    
  

344 rows × 2 columns

Arg. Still not really easy to look at.

Did notice a few things though, namely that there are some area+office numbers that actually had a higher proportion of robocallers. Looks like decent numbers have no robocallers, could be one of those sections that is already populated by real people and no room to get additional numbers?

I'm tending towards using Random Forests to classify data, how well will it work when there aren't many samples for a given category?

Let's make a different version of that thresholded function now that lets you choose the direction of the threshold.



In [41]:

    
def explore_thresholded_feature(df, name, threshold=.20, min_samples=10, tend_toward_robocallers=True):
    feature_v_robocall = pd.crosstab([df[name]], df['LIKELY ROBOCALL'])
    proportionated_feature = feature_v_robocall.div(feature_v_robocall.sum(1).astype(float), axis=0)
    sample_size = feature_v_robocall.sum(1)
    
    # Seeking those with LOTS of robo callers
    threshold_true_robo = (proportionated_feature[True] > (1 - threshold))
    
    # Conditionally look at those that tend not to have robocallers
    if(not tend_toward_robocallers):
        threshold_true_robo |= proportionated_feature[True] < threshold
    
    # Limit by number of samples available
    threshold_true_robo &= (sample_size > min_samples)
    
    thresholded_feature_v_robocall = feature_v_robocall[threshold_true_robo]
    
    thresholded_feature_v_robocall.plot(kind='barh', stacked=True, color=["blue", "red"])
    proportionated_feature[threshold_true_robo].plot(kind='barh', stacked=True, color=["blue", "red"])
    
    return thresholded_feature_v_robocall

explore_thresholded_feature(labeled_data, "TO_OFFICE_CODE", threshold=.08, min_samples=25)









    Out[41]:






  
    
      LIKELY ROBOCALL
      False
      True
    
    
      TO_OFFICE_CODE
      
      
    
  
  
    
      215717
       3
        41
    
    
      229256
       2
        40
    
    
      281661
       1
        41
    
    
      304527
       2
        33
    
    
      305548
       2
        33
    
    
      316313
       0
        32
    
    
      347515
       0
        28
    
    
      401515
       4
        64
    
    
      408600
       0
        26
    
    
      408824
       2
        37
    
    
      508570
       1
        43
    
    
      509314
       4
        82
    
    
      510417
       2
        39
    
    
      520353
       3
        36
    
    
      786329
       5
       132
    
    
      805284
       3
        56
    
    
      831269
       0
        35
    
    
      951225
       2
        29

That is an interesting collection. 786329 really stands out. We'll keep this as a categorical feature for our classifier.

It's about time



In [42]:

    
labeled_data["DATE/TIME"]









    Out[42]:





0    2014-04-01
1    2014-04-01
2    2014-04-01
3    2014-04-01
4    2014-04-01
5    2014-04-01
6    2014-04-01
7    2014-04-01
8    2014-04-01
9    2014-04-01
10   2014-04-01
11   2014-04-01
12   2014-04-01
13   2014-04-01
14   2014-04-01
...
137348   2014-04-06 23:58:00
137349   2014-04-06 23:58:00
137350   2014-04-06 23:58:00
137351   2014-04-06 23:58:00
137352   2014-04-06 23:58:00
137353   2014-04-06 23:59:00
137354   2014-04-06 23:59:00
137355   2014-04-06 23:59:00
137356   2014-04-06 23:59:00
137357   2014-04-06 23:59:00
137358   2014-04-06 23:59:00
137359   2014-04-06 23:59:00
137360   2014-04-06 23:59:00
137361   2014-04-06 23:59:00
137362   2014-04-06 23:59:00
Name: DATE/TIME, Length: 137363, dtype: datetime64[ns]



In [43]:

    
# Extract Hour, Minute, and day of week
labeled_data["HOUR"] = labeled_data["DATE/TIME"].apply(lambda x: x.hour)
labeled_data["MINUTE"] = labeled_data["DATE/TIME"].apply(lambda x: x.minute)
labeled_data["DAYOFWEEK"] = labeled_data["DATE/TIME"].apply(lambda x: x.dayofweek)



In [38]:

    
explore_feature(labeled_data, "HOUR")









    Out[38]:






  
    
      LIKELY ROBOCALL
      False
      True
    
    
      HOUR
      
      
    
  
  
    
      0 
       4644
       2143
    
    
      1 
       3290
       1201
    
    
      2 
       2518
        774
    
    
      3 
       1185
        305
    
    
      4 
        638
         36
    
    
      5 
        582
          8
    
    
      6 
        379
          7
    
    
      7 
        335
          9
    
    
      8 
        321
          2
    
    
      9 
        375
          4
    
    
      10
        409
          1
    
    
      11
        872
         18
    
    
      12
       2033
        716
    
    
      13
       4043
       1710
    
    
      14
       6354
       3131
    
    
      15
       7648
       3899
    
    
      16
       7692
       4281
    
    
      17
       8104
       4097
    
    
      18
       8436
       4244
    
    
      19
       7637
       4063
    
    
      20
       7418
       4198
    
    
      21
       6617
       3806
    
    
      22
       6075
       3185
    
    
      23
       5161
       2759



In [44]:

    
explore_feature(labeled_data, "MINUTE")









    Out[44]:






  
    
      LIKELY ROBOCALL
      False
      True
    
    
      MINUTE
      
      
    
  
  
    
      0 
       1674
       703
    
    
      1 
       1598
       656
    
    
      2 
       1615
       679
    
    
      3 
       1557
       704
    
    
      4 
       1565
       752
    
    
      5 
       1569
       707
    
    
      6 
       1643
       738
    
    
      7 
       1562
       709
    
    
      8 
       1577
       754
    
    
      9 
       1581
       752
    
    
      10
       1566
       763
    
    
      11
       1636
       789
    
    
      12
       1681
       740
    
    
      13
       1573
       764
    
    
      14
       1543
       753
    
    
      15
       1654
       676
    
    
      16
       1645
       781
    
    
      17
       1633
       803
    
    
      18
       1591
       790
    
    
      19
       1580
       763
    
    
      20
       1552
       728
    
    
      21
       1526
       819
    
    
      22
       1528
       752
    
    
      23
       1489
       767
    
    
      24
       1612
       707
    
    
      25
       1552
       764
    
    
      26
       1496
       754
    
    
      27
       1544
       735
    
    
      28
       1467
       761
    
    
      29
       1494
       726
    
    
      30
       1621
       763
    
    
      31
       1577
       746
    
    
      32
       1518
       716
    
    
      33
       1502
       697
    
    
      34
       1477
       808
    
    
      35
       1526
       827
    
    
      36
       1552
       787
    
    
      37
       1540
       761
    
    
      38
       1544
       800
    
    
      39
       1510
       729
    
    
      40
       1531
       754
    
    
      41
       1538
       741
    
    
      42
       1525
       724
    
    
      43
       1529
       746
    
    
      44
       1471
       769
    
    
      45
       1414
       727
    
    
      46
       1509
       799
    
    
      47
       1522
       720
    
    
      48
       1528
       741
    
    
      49
       1451
       756
    
    
      50
       1523
       753
    
    
      51
       1484
       773
    
    
      52
       1495
       686
    
    
      53
       1458
       720
    
    
      54
       1544
       771
    
    
      55
       1495
       739
    
    
      56
       1457
       661
    
    
      57
       1536
       711
    
    
      58
       1556
       727
    
    
      59
       1530
       656



In [45]:

    
labeled_data["INTERVAL"] = pd.cut(labeled_data["MINUTE"], bins=range(-1,61,15), include_lowest=True)
explore_feature(labeled_data, "INTERVAL")









    Out[45]:






  
    
      LIKELY ROBOCALL
      False
      True
    
    
      INTERVAL
      
      
    
  
  
    
      (14, 29]
       23363
       11326
    
    
      (29, 44]
       22961
       11368
    
    
      (44, 59]
       22502
       10940
    
    
      [-1, 14]
       23940
       10963

Minutes probably need to be paired up with hour



In [46]:

    
labeled_data["TIMECHUNK"] = labeled_data["DATE/TIME"].apply(lambda x: x.hour + np.floor(4*(x.minute/60.0))/4)



In [47]:

    
explore_feature(labeled_data, "TIMECHUNK")









    Out[47]:






  
    
      LIKELY ROBOCALL
      False
      True
    
    
      TIMECHUNK
      
      
    
  
  
    
      0.00 
       1286
        578
    
    
      0.25 
       1209
        585
    
    
      0.50 
       1121
        546
    
    
      0.75 
       1028
        434
    
    
      1.00 
        870
        295
    
    
      1.25 
        838
        280
    
    
      1.50 
        870
        311
    
    
      1.75 
        712
        315
    
    
      2.00 
        854
        204
    
    
      2.25 
        811
        212
    
    
      2.50 
        449
        187
    
    
      2.75 
        404
        171
    
    
      3.00 
        329
         85
    
    
      3.25 
        297
         84
    
    
      3.50 
        319
         70
    
    
      3.75 
        240
         66
    
    
      4.00 
        159
         22
    
    
      4.25 
        170
         10
    
    
      4.50 
        164
          1
    
    
      4.75 
        145
          3
    
    
      5.00 
        195
          2
    
    
      5.25 
        135
          2
    
    
      5.50 
        122
          4
    
    
      5.75 
        130
          0
    
    
      6.00 
         88
          1
    
    
      6.25 
         92
          2
    
    
      6.50 
        101
          1
    
    
      6.75 
         98
          3
    
    
      7.00 
         85
          2
    
    
      7.25 
         79
          1
    
    
      ...
      ...
      ...
    
    
      16.50
       1912
       1051
    
    
      16.75
       1869
       1028
    
    
      17.00
       1938
       1032
    
    
      17.25
       2047
        961
    
    
      17.50
       1958
       1024
    
    
      17.75
       2161
       1080
    
    
      18.00
       2143
       1041
    
    
      18.25
       2103
       1055
    
    
      18.50
       2066
       1136
    
    
      18.75
       2124
       1012
    
    
      19.00
       2064
       1008
    
    
      19.25
       1891
       1039
    
    
      19.50
       1849
       1014
    
    
      19.75
       1833
       1002
    
    
      20.00
       2036
       1166
    
    
      20.25
       1781
       1126
    
    
      20.50
       1841
        977
    
    
      20.75
       1760
        929
    
    
      21.00
       1709
        879
    
    
      21.25
       1628
       1006
    
    
      21.50
       1615
        972
    
    
      21.75
       1665
        949
    
    
      22.00
       1582
        860
    
    
      22.25
       1516
        875
    
    
      22.50
       1532
        785
    
    
      22.75
       1445
        665
    
    
      23.00
       1441
        655
    
    
      23.25
       1353
        692
    
    
      23.50
       1197
        717
    
    
      23.75
       1170
        695
    
  

96 rows × 2 columns

Quite similar to the hour curve, but clearly more granular.

Is there a way to track that hours wrap around? Can my classifier care that this isn't bounded at 0 and 24, that there is a modulus involved?



In [48]:

    
explore_feature(labeled_data, "DAYOFWEEK")









    Out[48]:






  
    
      LIKELY ROBOCALL
      False
      True
    
    
      DAYOFWEEK
      
      
    
  
  
    
      1
       19402
        9437
    
    
      2
       21898
       10250
    
    
      3
       19926
       10455
    
    
      4
       17896
        9145
    
    
      5
        8571
        3874
    
    
      6
        5073
        1436

Wait! Where is 0? That's some strong sampling bias if 0 (Monday) isn't even included...

UPDATE: I spoke with the group that produced the data and they accidentally got rid of Monday. I do have more data I could work with in the future.

Forest for the trees



In [49]:

    
def total_call_volume(df):
    sizes = df.groupby("FROM").size()

    def get_size(val):
        return sizes[val]

    df["NUM_FROM_CALLS"] = df["FROM"].apply(get_size)

    sizes = df.groupby("TO").size()
    df["NUM_TO_CALLS"] = df["TO"].apply(get_size)
    
total_call_volume(labeled_data)
total_call_volume(unlabeled_data)



In [50]:

    
explore_feature(labeled_data, "NUM_FROM_CALLS")









    Out[50]:






  
    
      LIKELY ROBOCALL
      False
      True
    
    
      NUM_FROM_CALLS
      
      
    
  
  
    
      1   
       23207
        384
    
    
      2   
       10834
        592
    
    
      3   
        7347
        582
    
    
      4   
        5444
        740
    
    
      5   
        3820
        690
    
    
      6   
        3402
        666
    
    
      7   
        2177
        546
    
    
      8   
        1840
        544
    
    
      9   
        1800
        531
    
    
      10  
        1530
        620
    
    
      11  
        1012
        506
    
    
      12  
        1692
        336
    
    
      13  
         975
        494
    
    
      14  
         770
        462
    
    
      15  
         900
        525
    
    
      16  
         672
        528
    
    
      17  
         714
        425
    
    
      18  
         792
        522
    
    
      19  
         551
        304
    
    
      20  
         440
        600
    
    
      21  
         252
        567
    
    
      22  
         418
        286
    
    
      23  
         552
        391
    
    
      24  
         504
        336
    
    
      25  
         275
        400
    
    
      26  
         364
        442
    
    
      27  
         351
        486
    
    
      28  
         252
        420
    
    
      29  
         348
        290
    
    
      30  
         450
        420
    
    
      ...
      ...
      ...
    
    
      274 
           0
        274
    
    
      285 
           0
        285
    
    
      289 
           0
        289
    
    
      292 
           0
        292
    
    
      341 
           0
        341
    
    
      350 
           0
        350
    
    
      351 
           0
        351
    
    
      355 
           0
        355
    
    
      361 
         361
          0
    
    
      366 
           0
        366
    
    
      371 
           0
        371
    
    
      379 
           0
        379
    
    
      386 
         386
          0
    
    
      397 
           0
        397
    
    
      411 
         411
        411
    
    
      417 
           0
        417
    
    
      436 
         436
          0
    
    
      443 
           0
        443
    
    
      477 
         477
          0
    
    
      480 
           0
        480
    
    
      488 
           0
        488
    
    
      590 
           0
        590
    
    
      645 
           0
        645
    
    
      674 
           0
        674
    
    
      858 
           0
        858
    
    
      915 
         915
          0
    
    
      945 
           0
        945
    
    
      1140
           0
       1140
    
    
      1566
        1566
          0
    
    
      3203
        3203
          0
    
  

176 rows × 2 columns

Summary

At this point, we've explored a few features and can build a model from them with some simple tools. Let's use those simple features to create a simple Random Forest classifier in Modeling Rachel the Robocaller.

	TO	FROM	DATE/TIME	LIKELY ROBOCALL
0	17866291260	13055793696	2014-04-01	False
1	14027826713	12063339487	2014-04-01	True
2	17083187970	12246108402	2014-04-01	False
3	17733095581	13035009570	2014-04-01	True
4	19188765408	16153878533	2014-04-01	True

	TO	FROM	DATE/TIME	LIKELY ROBOCALL
0	16163847430	13236069958	2014-06-01	False
1	12025176283	12029867020	2014-06-01	False
2	18663049187	15159256650	2014-06-01	False
3	15594157085	16199247140	2014-06-01	False
4	18582407865	19492012595	2014-06-01	False

LIKELY ROBOCALL	False	True
FROM_TIMEZONE
America/Chicago	18604	6350
America/Denver	5350	1805
America/Dominica	0	1
America/Edmonton	203	15
America/Grenada	1	0
America/Halifax	248	0
America/Jamaica	50	0
America/Juneau	66	0
America/Los_Angeles	20225	11578
America/Nassau	1	0
America/New_York	34589	12832
America/Port_of_Spain	1	0
America/Puerto_Rico	44	0
America/St_Johns	46	0
America/St_Thomas	4	0
America/Toronto	1192	430
America/Vancouver	210	2
America/Winnipeg	143	0
Etc/Lots	10817	11225
Etc/Unknown	839	356
Pacific/Guam	3	0
Pacific/Honolulu	114	3
Pacific/Saipan	16	0

LIKELY ROBOCALL	False	True
TO_AREA_CODE
202	0.822464	0.177536
204	1.000000	0.000000
212	0.851852	0.148148
213	0.837956	0.162044
225	0.820144	0.179856
226	1.000000	0.000000
276	0.828947	0.171053
289	0.982857	0.017143
303	0.822581	0.177419
312	0.802326	0.197674
314	0.814085	0.185915
319	0.876033	0.123967
418	1.000000	0.000000
519	0.959596	0.040404
615	0.810651	0.189349
646	0.827586	0.172414
647	0.923077	0.076923
705	1.000000	0.000000
754	1.000000	0.000000
770	0.812030	0.187970
772	0.843750	0.156250
780	0.981481	0.018519
800	1.000000	0.000000
808	0.806723	0.193277
828	0.913363	0.086637
855	0.904875	0.095125
866	0.986063	0.013937
877	0.993841	0.006159
888	0.991340	0.008660
906	0.805556	0.194444
915	0.822917	0.177083
949	0.897206	0.102794
980	0.818182	0.181818

LIKELY ROBOCALL	False	True
TO_OFFICE_CODE
201266	47	0
201453	102	0
201676	34	0
201693	37	1
202640	25	2
203416	29	0
203802	37	0
206496	465	24
213344	239	2
213785	38	2
215717	3	41
229256	2	40
239201	64	0
240297	39	1
281661	1	41
289814	162	0
301658	64	1
303848	41	0
304527	2	33
304982	55	0
305548	2	33
305809	99	0
314282	36	1
314714	27	0
314888	26	1
316313	0	32
317653	43	1
318497	36	1
319313	40	0
319540	63	0
...	...	...
888803	48	0
888814	34	0
888815	35	0
888821	38	0
888834	43	0
888864	40	0
888870	98	0
888875	37	0
888885	33	1
888919	29	0
888931	36	0
888966	69	0
888979	42	0
888997	123	0
888998	2632	1
906428	35	0
906563	30	0
909275	103	8
909361	27	1
912226	29	2
919289	36	0
919636	25	2
925575	213	4
937595	66	4
940202	33	0
949528	1418	7
951225	2	29
954621	36	1
972201	86	2
973273	120	8

LIKELY ROBOCALL	False	True
HOUR
0	4644	2143
1	3290	1201
2	2518	774
3	1185	305
4	638	36
5	582	8
6	379	7
7	335	9
8	321	2
9	375	4
10	409	1
11	872	18
12	2033	716
13	4043	1710
14	6354	3131
15	7648	3899
16	7692	4281
17	8104	4097
18	8436	4244
19	7637	4063
20	7418	4198
21	6617	3806
22	6075	3185
23	5161	2759

LIKELY ROBOCALL	False	True
MINUTE
0	1674	703
1	1598	656
2	1615	679
3	1557	704
4	1565	752
5	1569	707
6	1643	738
7	1562	709
8	1577	754
9	1581	752
10	1566	763
11	1636	789
12	1681	740
13	1573	764
14	1543	753
15	1654	676
16	1645	781
17	1633	803
18	1591	790
19	1580	763
20	1552	728
21	1526	819
22	1528	752
23	1489	767
24	1612	707
25	1552	764
26	1496	754
27	1544	735
28	1467	761
29	1494	726
30	1621	763
31	1577	746
32	1518	716
33	1502	697
34	1477	808
35	1526	827
36	1552	787
37	1540	761
38	1544	800
39	1510	729
40	1531	754
41	1538	741
42	1525	724
43	1529	746
44	1471	769
45	1414	727
46	1509	799
47	1522	720
48	1528	741
49	1451	756
50	1523	753
51	1484	773
52	1495	686
53	1458	720
54	1544	771
55	1495	739
56	1457	661
57	1536	711
58	1556	727
59	1530	656

LIKELY ROBOCALL	False	True
INTERVAL
(14, 29]	23363	11326
(29, 44]	22961	11368
(44, 59]	22502	10940
[-1, 14]	23940	10963

LIKELY ROBOCALL	False	True
TIMECHUNK
0.00	1286	578
0.25	1209	585
0.50	1121	546
0.75	1028	434
1.00	870	295
1.25	838	280
1.50	870	311
1.75	712	315
2.00	854	204
2.25	811	212
2.50	449	187
2.75	404	171
3.00	329	85
3.25	297	84
3.50	319	70
3.75	240	66
4.00	159	22
4.25	170	10
4.50	164	1
4.75	145	3
5.00	195	2
5.25	135	2
5.50	122	4
5.75	130	0
6.00	88	1
6.25	92	2
6.50	101	1
6.75	98	3
7.00	85	2
7.25	79	1
...	...	...
16.50	1912	1051
16.75	1869	1028
17.00	1938	1032
17.25	2047	961
17.50	1958	1024
17.75	2161	1080
18.00	2143	1041
18.25	2103	1055
18.50	2066	1136
18.75	2124	1012
19.00	2064	1008
19.25	1891	1039
19.50	1849	1014
19.75	1833	1002
20.00	2036	1166
20.25	1781	1126
20.50	1841	977
20.75	1760	929
21.00	1709	879
21.25	1628	1006
21.50	1615	972
21.75	1665	949
22.00	1582	860
22.25	1516	875
22.50	1532	785
22.75	1445	665
23.00	1441	655
23.25	1353	692
23.50	1197	717
23.75	1170	695

LIKELY ROBOCALL	False	True
DAYOFWEEK
1	19402	9437
2	21898	10250
3	19926	10455
4	17896	9145
5	8571	3874
6	5073	1436

LIKELY ROBOCALL	False	True
NUM_FROM_CALLS
1	23207	384
2	10834	592
3	7347	582
4	5444	740
5	3820	690
6	3402	666
7	2177	546
8	1840	544
9	1800	531
10	1530	620
11	1012	506
12	1692	336
13	975	494
14	770	462
15	900	525
16	672	528
17	714	425
18	792	522
19	551	304
20	440	600
21	252	567
22	418	286
23	552	391
24	504	336
25	275	400
26	364	442
27	351	486
28	252	420
29	348	290
30	450	420
...	...	...
274	0	274
285	0	285
289	0	289
292	0	292
341	0	341
350	0	350
351	0	351
355	0	355
361	361	0
366	0	366
371	0	371
379	0	379
386	386	0
397	0	397
411	411	411
417	0	417
436	436	0
443	0	443
477	477	0
480	0	480
488	0	488
590	0	590
645	0	645
674	0	674
858	0	858
915	915	0
945	0	945
1140	0	1140
1566	1566	0
3203	3203	0